Record Linkage Modeling in Federal Statistical Databases

نویسنده

  • Michael D. Larsen
چکیده

Record linkage (e.g., Fellegi and Sunter 1969, Newcombe et al. 1959) involves comparing two or more files on the same population for purposes of unduplication of records and merging files. Record linkage is used in many applications, including population size estimation at the U.S. Census Bureau (Winkler 1994, 1995, and Jaro 1989, 1995), epidemiology and medical studies (Newcombe 1988, Gill 1997), sociological studies (Belin et al. 2004), survey frame improvement, and, more recently, counterterrorism (Gomatam and Larsen 2004). See also Alvey and Jamerson (1997) and references therein. Latent class (McCutcheon 1987) and mixture models (McLachlan and Peel 2000) have been used to model the data arising from comparing records in two files (Larsen and Rubin 2001, Winkler 1988, 1994, 1995, Jaro 1989, 1995). Although successful in many applications (Alvey and Jamerson 1997), the models used in these applications have not accounted for all restrictions in the data. In particular, forcing each record on one file to have at most a single, matching record on the other file (“one-to-one matching”) has been implemented post-hoc with a one-to-one, linear-sum assignment procedure (Burkard and Derigs 1980, Jaro 1989) to choose individual links. The one-to-one assignment procedure can effectively eliminate many candidate links that have some degree of similarity, but actually are nonlinks. Experience from previous record linkage operations has been used informally to select models (Larsen and Rubin 2001) and restrict parameters (Winkler 1989, 1994). Bayesian approaches to record linkage have been suggested by Larsen (1999a, 2002, 2004, 2005), Fortini et al. (2002, 2000), and McGlinchy (2004). A procedure is described here that explicitly uses the one-to-one matching assumption and allows parameter values to vary by file block, which is a subset of the data being linked. The approach will necessarily be Bayesian, because of the relatively small sample sizes within blocks and the difficulty of calculating expectations under complex restrictions on unobserved data. This article is organized as follows. Record linkage is introduced in section 2. Bayesian record linkage algorithms are presented in section 3. Section 4 is a conclusion and discusses future work. Computational procedures are presented in appendices Appendix A and Appendix B. References are given in section 5.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Privacy-Preserving Record Linkage

Record linkage has a long tradition in both the statistical and the computer science literature. We survey current approaches to the record linkage problem in a privacy-aware setting and contrast these with the more traditional literature. We also identify several important open questions that pertain to private record linkage from different per-

متن کامل

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

A Comparison of Blocking Methods for Record Linkage

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality s...

متن کامل

Data Preparation for Biomedical Knowledge Domain Visualization: A Probabilistic Record Linkage and Information Fusion Approach to Citation Data

Data Preparation for Biomedical Knowledge Domain Visualization: A Probabilistic Record Linkage and Information Fusion Approach to Citation Data Marie B Synnestvedt Xia Lin Ph.D. This thesis presents a methodology of data preparation with probabilistic record linkage and information fusion for improving and enriching information visualizations of biomedical citation data. The problem of record l...

متن کامل

Assessing Disclosure Risk for Record Linkage

An intruder seeks to match a microdata file to an external file using a record linkage technique. The identification risk is defined as the probability that a match is correct. The nature of this probability and its estimation is explored. Some connections are made to the literature on disclosure risk based on the notion of population uniqueness.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010